Getting an LLM to return a JSON object that matches a schema is one of those tasks that looks trivial until you push it to production at thousands of documents per hour. As soon as the prompt gets complex, the model starts making the usual mistakes: a missing comma, a brace that closes too early, a friendly preamble of the “here’s the JSON you asked for” variety, a Markdown fence wrapping the answer, or simply a field that was not in the schema. With careful prompting and low temperature you can reach 98% success on a large model, but that remaining 2%, multiplied by volume, ends up eating more engineering time than anyone is willing to admit.
Constrained decoding solves the problem by construction rather than by post-hoc validation. The idea is simple and elegant. At every generation step, the model produces a probability distribution over the entire vocabulary, typically around fifty thousand tokens. Instead of sampling from that distribution as-is, we apply a mask: tokens that cannot extend a valid string under the target grammar are zeroed out, and we then renormalise and sample only over the legal subset. The output complies with the grammar as a mathematical guarantee, not a statistical one. No branch of the generation tree can produce broken JSON because the missing brace is simply not a reachable token.
Three families of tools dominate the conversation today. Outlines is the Python reference, works with Hugging Face Transformers, llama.cpp and vLLM, and lets you express constraints as a Pydantic model, a JSON Schema, a regular expression or a context-free grammar in Lark syntax. Microsoft’s Guidance takes a more template-oriented approach: you interleave fixed text with constrained-generation regions —a select between options, a gen with a regex, a JSON block— and the runtime handles the alternation. It shines when the output is not one structured blob but a conversation with partial structure. Instructor, finally, is not strictly a decoder: it wraps the OpenAI or Anthropic client and uses Pydantic to validate and retry, but it is popular enough to belong in the same conversation because it addresses the same pain with a different philosophy.
from outlines import models, generate
from pydantic import BaseModel
class Invoice(BaseModel):
number: str
total: float
lines: list[str]
model = models.transformers("meta-llama/Llama-3-8B-Instruct")
generator = generate.json(model, Invoice)
invoice = generator("Extract data from: Invoice A-0012, total 128.50, two lines")
It helps to be clear about what you are competing against. The classic path is the retry loop: ask the model for JSON, try to parse it, and if parsing fails, call the model again with the error as context so it can self-correct. It works reasonably well and is what powers Instructor under the hood, but every retry is a full call to the model, with its own latency and cost. In bulk-extraction workloads the time lost to retries dominates the budget. Constrained decoding instead pays its tax inside the same forward pass: each token is 10% to 30% slower because of the masking step, but there is never a second round. For batches of thousands of documents the arithmetic clearly favours it.
The second alternative is OpenAI’s function calling, or the JSON mode introduced in November 2023. JSON mode guarantees that the output parses as JSON, but not that it matches your schema: you can receive a syntactically valid object that has invented fields or silently changed types. Function calling goes one step further because it accepts a schema, but in practice there are still drifts on optional fields, enum values not always respected, and numeric types turning up as strings. OpenAI has announced it is working on a Structured Outputs mode with full schema guarantees, but it is not available yet. For self-hosted models, Outlines and Guidance are plainly superior because the guarantee is by construction.
Where it is worth adopting depends a lot on the use case. In large-scale structured extraction —invoices, resumes, clinical results, contracts— the difference between 98% and 100% valid output is enormous because it removes the error queue that otherwise needs to be reviewed or reprocessed by hand. In agents that emit tool calls, guaranteeing that the argument JSON is always parseable prevents a formatting error from tearing down an entire chain of reasoning. In synthetic data generation with a fixed schema, the guarantee lets you chain steps without intermediate safety nets. Conversely, for conversational chat with semi-free outputs, or when a frontier model with well-written prompts already delivers over 99%, the integration cost will not amortise.
Two misconceptions deserve pushing back on. The first is thinking that constrained decoding improves the semantic quality of the answer: it does not. The JSON will be valid and match the schema, but if the model misunderstood the document, the extracted values will still be wrong. The guarantee is syntactic, not about content. The second is assuming integration is trivial: moving from the OpenAI API to a local runtime with Outlines means managing GPUs, model versions, KV caches and the whole inference-serving apparatus. For small teams without their own ML platform, Instructor on top of OpenAI remains the pragmatic path even if you give up the formal guarantee.
On the near horizon —vLLM v0.4 integrates Outlines natively, llama.cpp has exposed --grammar for months, TGI has its own Guidance-based variant— it is clear that inference runtimes are treating constrained decoding as a first-class citizen. The message for anyone building on LLMs in 2024 is that format validation should move out of application code and into the generation layer, because that is where the marginal cost is lowest and the guarantee strongest. Keeping on patching with retries and json.loads wrapped in try/except is a form of technical debt that ages poorly.